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Abstract 

Folksonomies - large databases arising from collaborative tagging of 
items by independent users - are becoming an increasingly important way 
of categorizing information. In these systems users can tag items with free 
words, resulting in a tripartite item-tag-user network. Although there are 
no prescribed relations between tags, the way users think about the dif¬ 
ferent categories presumably has some built in hierarchy, in which more 
special concepts are descendants of some more general categories. Several 
applications would benefit from the knowledge of this hierarchy. Here 
we apply a recent method to check the differences and similarities of hi¬ 
erarchies resulting from tags given by independent individuals and from 
tags given by a centrally managed repository system. The results from 
our method showed substantial differences between the lower part of the 
hierarchies, and in contrast, a relatively high similarity at the top of the 
hierarchies. 

Keywords: tag, hierarchy, ontology reconstruction, folksonomy, knowledge 
mapping 


1 Introduction 

The recent appearance of tags in large online datasets represents a significant 
innovation in categorisation mm®. Tags allow multiple categories for each 
item, and tagging can be done in a bottom-up approach, in a parallel manner, 
by several users simultaneously HEHS]. This feature allows the tagging of huge 
datasets in a reasonable time. In contrast, traditional hierarchical categorisation 
typically allows one category per item, and it is done by a few experts, slowing 
down the process. Also, available categories are restricted in traditional expert- 
made hierarchies, while user given tags are usually allowed to take any expression 
deemed relevant by the user. 
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Although there is no prescribed structure between the tags, it is a reasonable 
assumption that tags are attached to objects according to hidden hierarchical 
relations, e.g., “poodle” is usually considered as a special case of “dog”. Con¬ 
sequently, it is an interesting non-trivial task to extract this implicit hierarchy 
from the co-appearance of tags solely. Indeed, a number of different methods 
have already been proposed in the literature, such as aggregation of user-defined 
shallow hierarchies for obtaining a global hierarchy [30, integration of infor¬ 
mation from as many sources as possible 0, using a probabilistic criterion to 
define parent-child relations [10] . applying pairwise similarities to centrality- 
ordered tags m, or building up the hierarchy from bottom up based on the 
z-score between the tags m- 

Beside the organisation of different keywords or categories describing a given 
topic, signs of hierarchy are prevalent in a very wide range of systems. Among 
others, the transcriptional regulatory network of Escherichia coli E, the dominant- 
subordinate hierarchy among crayfish [14] , the leader-follower network of pigeon 
flocks m , the rhesus macaque kingdoms m , neural networks El, techno¬ 
logical networks [18] , social interactions mmm, urban planning [221 |23| . 
ecological systems [21] [25], and evolution [Ml [23 all show signs of hierarchi¬ 
cal organisation. Different approaches were introduced to uncover hierarchy 
in networks, including the introduction of hierarchy measures [Ml EM GM EH, 
statistical inference of hierarchy j32j and construction of hierarchical network 
models [551 . 

Here we analyse the hierarchies obtained for the scientific keywords from the 
Web of Science [51] by applying a recent generalisation of the method given in 
Ref. [12] presented in [85] . We treat the set of author given tags and the set of 
repository given tags separately, resulting in two alternative hierarchies. These 
are compared to each other and also to the 3-level classification of categories 
given by the Web of Science. The organisation of the paper is the following: in 
Sect. [2] we introduce the tag hierarchy construction methodology and describe 
the datasets to which it is applied. The obtained hierarchies are presented in 
Sect. [3] while the results are discussed in Sect. [4] 

2 Materials and Methods 

2.1 Tag hierarchy construction 

In order to obtain a tag hierarchy, we will follow the method described in [12] 
and [35] . for which a quick overview is provided here. 

Given a set of objects and each object having a set of tags, the goal is to 
construct a hierarchy, i.e., a directed acyclic graph (DAG) of the tags, where 
links are directed from more general concepts to more special ones. Our method 
constructs a hierarchy in two steps: first the tags are ordered, defining which 
tag should be placed higher in the hierarchy and which lower, then for each tag 
an appropriate parent is chosen. Note, that in the second step here we allow 
to choose more than one parent for a tag, hence the resulting hierarchy can be 
more complex than a simple tree. 

For the reader who is not familiar with the method [T2i we briefly summarize 
the main steps below. First we rank first the tags according to the eigenvector 
centrality of the tag-coappearence graph. Nodes in the co-appearance graph 
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correspond to the tags, and links represent the co-appearances of the tags on 
the same object. The weights of the links are given by the number of co¬ 
appearances. However, when calculating the eigenvector centrality, links having 
a z-score below a certain threshold value are neglected. The z-score is calculated 
as the observed number of objects where the two tags co-appear minus the 
expected number co-occurrences when tags are randomly shuffled. The z-score 
is normalized by the standard deviation of random co-occurrences, 


i — 


Cij ~ fUJ 


(i) 


where is the number of times tags i and j co-appear, and er,;j are the 
expected value and standard deviation, respectively, for randomly reshuffled 
tags. 

In the second step the hierarchy is built according to a bottom-up approach, 
i.e., we look for parents at each tag i in ascending order of their eigenvector 
centrality. We choose a tag to be the first parent of i, when it has higher 
eigenvector centrality than i and has maximal score among possible parents. 
The score here is the sum of the z-scores of the links between the candidate 
parent and the descendants of i. and between i itself. Note, that by aggregating 
the descendants’ z-scores, we take into account much more information than 
any pairwise similarity metric can provide. Finally, we allow further parents if 
they have links to i with at least as high z-score as the first parent. 


2.2 Dataset 

We study the keywords of scientific papers between 1975 and 2011 obtained from 
the Web of Science. The dataset contains 35 371 214 papers, which are tagged 
by three type of tags. The first type (heading) gives a very broad categorisa¬ 
tion of the paper, there are only 5 tags of this type: Arts & Humanities, Life 
Sciences & Biomedicine, Multidisciplinary Science & Technology, 
Physical Sciences and Social Sciences. The second type (category) has 
251 more fine-grained scientific areas like Chemistry, Analytical or Engineering, 
Geological. Tags of the third type are chosen from two sets of specific phrases. 
One set is composed from the keywords which originated from the authors of 
the papers. The other set is given by the Web of Science service, and targeted as 
complementary to the author-given keywords. We will refer to the first keywords 
as authorkeywords and to the other as woskeywords. There are a huge number 
of third-type-tags: the woskeywords set contains 2 245 143 phrases and the au¬ 
thorkeywords set contains 6 891 089, which are very specific, like Zygapophyseal 
arthritis or H-3 -R-alpha-methylhistamine binding. Although these key¬ 
words are aimed to be complementary on the level of individual papers, still 
883 836 of them appear both in the set of woskeywords and authorkeywords. Fi¬ 
nally, we note that the Web of Science does not define any hierarchical relations 
between the tags, i.e., the ancestors or descendants of the tags are not given in 
the data set, only the categorization into the three major types is provided. 


3 Results 


The aim here is to apply the methodology of Sec. |2.1| to the data described in 
Sec. 2.2 in order to study the differences and similarities of hierarchies resulting 
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from tags given by independent individuals and from tags given by a repository. 
In the first case the input of the hierarchy reconstruction is given by heading, 
category and authorkeyword tags, while in the second case the heading, cate¬ 
gory and woskeywords tags. Note that the general and intermediately general 
type tags are common in both datasets, and these tags are given by the reposi¬ 
tory management system. The difference between the independent tagging and 
centrally managed tagging comes from the most numerous third level tags. We 
compare below the hierarchies of the two taggings. First we compare the up¬ 
per most part of the reconstructed DAGs. Then the hierarchy level occupation 
statistics of the DAGs are compared for each tag types. Finally the horizontal 
(branching) structures of the DAGs are analysed. 

In the reconstructed hierarchies obtained from our method both DAGs had 
4 dominant roots at the highest level of the hierarchy, being the ancestors 
of 99.8%-99.9% of the available tags. The DAGs contain several other non¬ 
dominant roots, corresponding to tiny connected components which cover only 
0.1%-0.2% of the tags. The four dominant roots coincide with the heading type 
tags except Multidisciplinary Science & Technology, which appears as a 
child of Physical Sciences. 

Next, we compare the vertical structures of the two DAGs by analysing the 
hierarchy level distribution of different tag types. A technical difficulty arises 
from the fact that a tag may belong to more roots, thus it can have more 
level values depending on the root from which it is counted. Here we classify 
tags to hierarchy levels according to their closest root, i.e., from the possible 
level numbers we associate the highest possible level to each tag. The resulting 
level distributions are shown on Fig. [lj They indicate that the position in 
the DAG correlates strongly with the heading-category-(author/wos)keyword 
classification, i.e., the reconstruction is consistent with the a priori classification 
of the tags in this respect. However, it is interesting to note, that while tags 
from different types mostly appear below each other in the expected order, tags 
from the same type also appear below each other - the reconstruction finds 
structure within the types. 

The third aspect is the horizontal similarity of the DAGs. Here we analyse 
whether common members of the DAGs are in similar horizontal position, i.e., 
having similar descendant subgraphs. Since the DAGs are constructed from 
the same header and category type tags and the two different keyword tags, 
we compare the horizontal structure of the two DAGs in two ways: i) first we 
restrict the analysis only for those tags, that are common in the two DAGs 
(header, category and common keywords) ii) secondly we restrict the analysis 
even more, considering only the header and category type tags, that are common 
by definition of the DAGs. 

For the first case, where we compare the horizontal position of the common 
keywords/categories/headers of the two DAGs, we calculated the linearised mu¬ 
tual information-based similarity of 112] . The result shows huge dissimilarity 
with 0.03 for the mutual informatiorQ A sample of the DAGs is shown on Fig. 
[2j around “vegetation response”. In both DAGs, related tags appear below the 
chosen tag, however, according to Fig. [2j descendants in one DAG differ from 
descendants in the other. These results are in accordance with the complemen¬ 
tary nature of the authorkeywords and woskeywords. 

1 The linearised mutual information ranges from 0 to 1. 
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level level 


Figure 1: Level-wise ratio of tags, for the 3 tag types. Left panel is for the 
authorkeyword DAG, right panel for the woskeyword DAG. The distribution is 
calculated for the tags that are members in at least one of the descendant sets 
of the 4 dominant roots. Roots are at level 1. 


If we restrict the calculation of the mutual information to the header and 
category tags only, the similarity jumps to 0.89, showing that the relations 
between general tags are quite robust, indeed, the hierarchies are built bottom- 
up, where the bottom parts are very different. Samples of these reduced DAGs 
are visualised on Fig. [3j They display a few branches below Life Sciences k 
Biomedicine, like Biochemistry k Molecular Biology, Cardiac k Cardiovascular 
Systems or Plant Sciences. The two sub-figures show that Neurosciences, 

Plant Sciences, Biophysics and Agronomy have more children in the woskey¬ 
word DAG, while Hematology is also connected to Transplantation in the au¬ 
thorkeyword DAG. 

Note that the reconstruction strongly depends on the descendants of each tag, 
especially for tags having several descendants, thus the difference between the 
authorkeywords and woskeywords could have led to very different structure at 
the top of the DAG [T2j. The very high similarity at the top of the hierarchy 
compared to the low similarity for the first case indicates, that the differences 
between the authorkeywords and the woskeywords result differences on the low 
levels of the hierarchy, while this difference does not propagate to the highest 
levels. 


4 Discussion 

Tag-based categorisation of large online datasets is becoming increasingly wide¬ 
spread. They allow free word tagging, multiple categories for items and user- 
based processing in a parallel manner instead of centralised expert-based pro¬ 
cessing. Although the tags have no predefined relations, it is reasonable to 
assume that users think to an extent in hierarchical relations between tags, i.e., 
using some tags as special cases of other, more general tags. 

Here we applied a recently introduced hierarchy construction method to 
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grassland fertilizer manipulation soil size post-exposure raccoon gravity-field 

diversity dissolution experiment fractions treatment rabies models 


Figure 2: Samples from the reduced DAGs (to heading, category and common 
keywords) with the woskeywords (top) and authorkeywords (bottom). Node 
sizes show the number of descendants in the reduced DAGs, on a logarithmic 
scale. 


keywords of scientific papers from the Web of Science. Tags were pre-organised 
by the Web of Science into 3 types, from the very general to the very special. For 
the most special type, 2 different sets of keywords were obtained, author-given 
and repository-given. Accordingly, two different hierarchies were constructed, 
each time using one of these sets as the special type, accompanied by the more 
general tags. 

First, the structures of the obtained hierarchies were compared to the 3 pre¬ 
defined tag types. Good correspondence was found here. For the most general 
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Figure 3: Samples from the reduced DAGs (to heading and category) of the 
woskeywords (top) and authorkeywords (bottom) based reconstructions. Reduc¬ 
tion left only the 256 heading and category tags. Node sizes show the number 
of descendants in the reduced DAGs, on a logarithmic scale. 


type, 4 out of the 5 member tags appeared as level 1 roots in the constructed 
hierarchies (Arts & Humanities, Life Sciences k Biomedicine, Physical 
Sciences and Social Sciences), the fifth one being an immediate child of 
one of them (Multidisciplinary Science k Technology). The intermediate 
type tags populated the next levels in the hierarchies, and members of the most 
specific type were at the lowest levels. An interesting observation is that the 
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tags were organised to significantly more levels than three, indicating that there 
is structure within the predefined types. 

Second, the two constructed hierarchies, using two different set of special 
keywords, were compared to each other. The hierarchies were reduced to the 
tags common in both of them, in order to make direct comparison possible. It 
was found that the organisation of the tags are very different, their similarity 
scoring 0.03 on a [0,1] scale. This is in accordance with their purpose, i.e., 
for each individual paper woskeywords are aimed to be complementary to the 
authorkeywords [36 1. On the other hand, when reducing the hierarchies only to 
the general and intermediately general type tags, a much higher 0.89 similarity 
was obtained, in spite of the fact that the hierarchies were constructed bottom 
up, allowing different lower levels resulting in different high levels. Interestingly, 
while the lower parts of the hierarchies were different, the more general tags were 
organised in a significantly similar way. 
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